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ABSTRACT 

The UniCarb KnowledgeBase (UniCarbKB; http:// 
unicarbkb.org) offers public access to a growing, 
curated database of information on the glycan 
structures of glycoproteins. UniCarbKB is an inter- 
national effort that aims to further our understand- 
ing of structures, pathways and networks involved 
in glycosylation and glyco-mediated processes by 
integrating structural, experimental and functional 
glycoscience information. This initiative builds 
upon the success of the glycan structure database 
GlycoSuiteDB, together with the informatic stand- 
ards introduced by EUROCarbDB, to provide 
a high-quality and updated resource to support 
glycomics and glycoproteomics research. 
UniCarbKB provides comprehensive information 
concerning glycan structures, and published glyco- 
protein information including global and site- 
specific attachment information. For the first 
release over 890 references, 3740 glycan structure 
entries and 400 glycoproteins have been curated. 
Further, 598 protein glycosylation sites have been 
annotated with experimentally confirmed glycan 
structures from the literature. Among these are 35 
glycoproteins, 502 structures and 60 publications 
previously not included in GlycoSuiteDB. This 
article provides an update on the transformation of 
GlycoSuiteDB (featured in previous NAR Database 
issues and hosted by ExPASy since 2009) to 
UniCarbKB and its integration with UniProtKB and 
GlycoMod. Here, we introduce a refactored 
database, supported by substantial new curated 
data collections and intuitive user-interfaces that 
improve database searching. 



INTRODUCTION 

Protein glycosylation is an important and universal post- 
translational modification that is estimated to occur on 
between 20% and 50% (1,2) of all secreted and cellular 
proteins. Glycoproteins are characterized by the presence 
of oligosaccharides linked to the peptide backbone 
through A^- or 0-glycosidic bonds at asparagine or 
serine/threonine residues, respectively. For both A^- and 
O-glycosylation, there can be considerable diversity of 
glycan structures associated with each glycosylation site. 
Such micro-heterogeneity is governed by an elaborate 
process carried out by numerous intricate and competitive 
steps, which result in the generation of tissue and cell type- 
specific glycan expression patterns. 

Given that protein glycosylation is involved in 
numerous cellular processes and is implicated in disease 
progression (3-5), the abihty to accurately characterize 
glycan structures (at a global and site-specific manner) 
and the identification of the modified proteins is increas- 
ingly important in functional glycomics (6-8). The mo- 
lecular and functional complexity of glycoproteins is 
challenging and requires sustainable bioinformatic re- 
sources aimed at capturing, integrating and maintaining 
the available knowledge. The more complete our under- 
standing of glycosylation is the better equipped we wiU be 
to understand the functional and structural roles of both 
glycoproteins and the attached glycans at the molecular 
level. Unfortunately and despite the success of several 
international initiatives the glycosciences still lack a 
managed infrastructure that contributes to the advance- 
ment of research through the provision of comprehensive 
structural and experimental glycan data collections. 
As described by the US National Academy of Sciences 
report 'Transforming Glycoscience: A Roadmap for the 
Future' (9) an important factor in broadening the appre- 
ciation of glycomics is the necessity to develop robust, 
scalable and standardized bioinformatic platforms to 
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acquire and disseminate the information-rich data collec- 
tions that are becoming increasingly available. 



OVERVIEW OF UniCarbKB 

UniCarbKB is an initiative that aims to promote the 
creation of a curated, glycan structure based, online infor- 
mation storage and search platform for glycoscience 
research (10). This initiative builds upon the previously 
successful databases, the AustraUan developed 
GlycoSuiteDB (11) and the EU-funded EUROCarbDB 
(12), to offer a freely accessible updated platform built 
with modern front and back-end technologies. The re- 
engineered framework offers an intuitive user-interface 
with enhanced features and greater support for on-going 
international efforts to estabUsh common data standards 
that will better integrate structural, experimental and 
functional data collections. 

GlycoSuiteDB 

GlycoSuiteDB is acknowledged as the first effort to pro- 
vide detailed non-redundant, curated structural informa- 
tion derived from the pubUshed literature on conjugated 
glycans. The database connects glycan structure and bio- 
logical origin with protein specific information where 
known. For each glycan structure (e.g. glycan type, mass 
and composition), detailed information is provided on the 
native and recombinant sources (i.e. tissue and/or cell 
type, cell line, strain and disease state), with appropriate 
Unks to Swiss-Prot/TrEMBL entries, a record of the 
methods used to determine the structure, and the 
PubMed ID of the cited publication. The design objectives 
and functionaUty of GlycoSuiteDB has been pubUshed 
in previous NAR database issues (11,13). Originally de- 
veloped commercially, it has since been made available 
publicly through the ExPASy server (14). Access to 
the content has not only preserved the efforts of the 
curation team, but is now helping to seed the 
UniCarbKB effort to build an up-to-date high-quaUty 
resource for glycoscience. 

EUROCarbDB 

The EUROCarbDB project was a collaborative European 
design study that focused on building the foundations of 
a technical framework to support glycobioinformatic 
activities. This resulted in the provision of sophisticated 
open-source tools and structure encoding formats and 
databases that, to date, continue to support several 
facets of analytical glycomics. The architecture of the 
EUROCarbDB database started to address stumbHng 
blocks impeding progress in glycomics by providing the 
glycobiology community with (i) universal standards for 
the representation of monosaccharides and complex 
glycans, (ii) a freely accessible database of known glycan 
structures and experimental evidence, (iii) freely accessible 
analytical tools for researchers and (iv) a technical frame- 
work of open-source code. 



UniCarbKB: building upon the foundations laid by 
GlycoSuiteDB and EUROCarbDB 

UniCarbKB is focused on enhancing existing tools, stand- 
ards and applications to be more accessible and amenable 
to modern research workflows. In particular we have 
leveraged previous experiences to build a modern and 
scalable framework, which uses technologies and web 
frameworks that are more familiar to developers. As the 
first step we have merged the glycan structural informa- 
tion from the no longer supported GlycoSuiteDB and 
EUROCarbDB initiatives into a high-quaUty updated 
framework. Several libraries developed by the 
EUROCarbDB initiative including GlycanBuilder (15), 
MonosaccharideDB (http://www.monosaccharidedb.org) 
and GlycoCT (16) have also been incorporated. 

DESIGN AND IMPLEMENTATION 

UniCarbKB is built with the open-source framework 
Play (Release 2.1.3) written in Java and Scala, which 
follows the model-view-controller architecture. The views 
(user-interface) are predominantly written in Scala and 
include JQuery and Bootstrap Javascript Hbraries. The 
model and controller layers are written in Java and the 
Ebean object-relational mapping (ORM) library is used to 
query the underlying database model. UniCarbKB uses 
PostgreSQL (Version 9.2) as the underlying database 
system that consists of multiple schemas to ensure data 
integrity by managing structural, literature and experi- 
mental data collections. 

We have updated the front-end of GlycoSuiteDB with a 
refreshed interface that is easier to navigate and focuses on 
displaying content the researcher wants to access. The new 
update is more visual, includes new, simpler content 
layout and improved accessibiUty options. All pages now 
display a menu bar to make it easier to access features and 
navigate around the website in a consistent way. The in- 
clusion of JavaScript hbraries and Bootstrap has allowed 
us to enhance the user experience. For example, we have 
changed the search functionality by including an auto 
completion feature in addition the use of pagination and 
drop-down hsts improve the handhng and display of large 
data collections. During the design process a series of 
sketched user interface wireframes and mockups were 
created, which allowed developers to consult with the 
end-user to test and refine navigation, evaluate the effect- 
iveness of page layouts and determine web development/ 
programming requirements. 

DATABASE CONTENT 

Primarily, UniCarbKB is a eukaryotic glycoprotein- 
centric resource built on the corpus of curated information 
originating from GlycoSuiteDB and a select few datasets 
from EUROCarbDB. We have expanded the content by 
manually curating over 60 more recent publications that 
contain (partial or completely characterized) glycan struc- 
tures with supporting experimental data that substantially 
extend the content coverage of GlycoSuiteDB. A majority 
of the newly sourced data are derived from a hterature 
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study by Thaysen-Andersen and Packer, which sought to 
correlate A^-glycan structures characterized from purified 
glycoproteins with protein structure (17). This comprehen- 
sive dataset will contribute over 470 glycosylation sites 
from over 160 mammahan glycoproteins, from different 
tissues and body fluids to the over 3700 glycan structure 
entries, 400 glycoproteins and 598 protein glycosylation 
sites already curated by UniCarbKB. For each glycopro- 
tein record two levels of database annotations are 
provided: (i) site-specific data for individual glycan struc- 
tures that are associated to an amino acid sequence pos- 
ition and (ii) where a single purified glycoprotein has been 
analysed, all characterized glycan structures are hnked to 
the glycoprotein accession number in UniProtKB (14). 
Also, new structural and experimental glycan data have 
been contributed from the integration of the final public 
release of GlycoBase (18) developed in conjunction with 
EUROCarbDB. 



SEARCH FEATURES 

As part of our effort to improve the overall user experi- 
ence a series of new interfaces have been implemented 



(Figure 1). Many build upon and retain the features avail- 
able in GlycoSuiteDB but with improved functionahty. 
For example, we have (i) enhanced native selects by 
including a multi-select interface (ii) made greater use of 
Javascript to load items via Ajax supporting the ability to 
partially load dataset and (iii) paginated for better support 
of the presentation and navigation of large datasets. 
Supported queries include (sub)structure, monosaccharide 
composition, glycan mass, taxonomy, tissue, disease, 
glycoprotein (accession number or Swiss-Prot name) and 
hterature pubhcation. 

Structure searching with GlycanBuilder 

Previously, GlycoSuiteDB provided a structure interface 
that consisted of textual and form based input, however, 
many researchers prefer to graphically visualize glycan 
structures due to their inherent complexity. We have 
incorporated GlycanBuilder (Vaadin Release) (19) into 
the search functionahty of UniCarbKB that supports the 
exact or partial matching of structures in the database. 
The user may (i) build a new structure, (ii) extend a struc- 
ture from a predefined hst or (iii) build a substructure/ 
epitope; in all instances the anomeric configuration of a 
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Figure 1. UniCarbKB offers a number of improved query interfaces. (A) Users can search the database content by monosaccharide composition, 
attached protein, taxonomy or tissue by using an auto completion feature. {B) The integration of latest version of GlycanBuilder allows users to 
query UniCarbKB by (sub)structure searching. 
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monosaccharide residue and linkage type can be defined. 
By default an exact search wiU only retrieve those 
database structures that perfectly match the topology, 
hnkage and anomeric configuration submitted. In the 
case of partial searching, a level of fuzziness is introduced, 
whereby unknown information is handled as wildcards by 
the search algorithm. For substructure searching only 
those structures that have the (extended) epitope or 
motif built will be returned. 

Glycan structure encoding 

A plethora of graphical and textual formats are available 
for the depiction of glycan structures including: the 
Consortium for Functional Glycomics/Essentials, the 
Oxford nomenclature and lUPAC formats. Historically, 
GlycoSuiteDB encoded glycan structures in an lUPAC 
style format, however, recent databases have adopted con- 
nection table approaches exemplified by GlycoCT and 
KCF (KEGG Chemical Function) (20) to describe oligo- 
saccharide sequences with a controlled vocabulary. By ex- 
tension of these efforts UniCarbKB supports the storage 
of GlycoCT and lUPAC formats, and to further extend 
database interoperabihty an lUPAC to KCF and a 
modified lUPAC to GlycoCT translator have recently 



been developed to complement existing translators. 
Similar to GlycomeDB and EUROCarbDB we have im- 
plemented a feature that enables users to switch between 
supported graphical formats. This feature is made possible 
by integrating the GlycanBuilder API, which produces 
high-quality representations of glycan structures. 

METADATA EXTENSIONS 

To enable users of UniCarbKB to assess the reUability of 
the contained information, provenance metadata must be 
recorded. Provenance metadata relates to the origin of the 
data and deals less with the finer details and more with the 
process of how the data came to be. 

Biological source 

The biological context module, developed by 
EUROCarbDB, handles the association of structure to 
biological source that amalgamates taxonomy and tissue, 
together with a varying number of disease and perturb- 
ation associations (Figure 2). The hbrary adopts the 
controlled vocabularies derived from the NCBI 
Taxonomy and the MeSH (Medical Subject Headings) 
databases. Its inclusion reduces data redundancy by 
providing a hierarchical controlled vocabulary that links 
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Figure 2. Screenshot for the Coagulation Factor IX entry in UniCarbKB. Ttie database provides a description of the glycan structures characterized 
for this glycoprotein and the number of structures associated with experimentally confirmed glycosylation sites. A general description of the 
glycoprotein is provided that has been derived from the relevant UniProtKB entry in addition to the protein sequence. Users have the Structure 
Format option to select a preferred graphical format to display the glycan structures that includes the three most commonly used notations. 
Supporting biological metadata and publication information is also provided. The Biological Associations include details pertinent to the species, 
the individual protein and tissue/secretory fluid source. Finally, those References that have been manually curated to obtain this information are 
summarized with appropriate links to PubMed. 
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specific taxonomic descriptions with more generalized 
terms e.g. specific tumours or cancer of the lung are 
grouped under the more general term 'Lung Neoplasms'. 
This approach improves upon the disconnected terms used 
in GlycoSuiteDB, by proving a more robust interface to 
searching and grouping together glycan structures based 
on taxonomic or disease terms. 

Identification of mettiods 

The reporting of descriptive metadata that is representa- 
tive of the reported hterature poses many challenges, but is 
essential for the development of a well-documented glycan 
and glycoprotein database. Efforts led by the Minimum 
Information for A Glycomics Experiment (MIRAGE) 
(21) project and the ontology work of GlycoRDF aim to 
alleviate this situation by providing standardized data 
entry terms, therefore fulfilhng one of the recommenda- 
tions of the NAS Committee on Assessing the Importance 
and Impact of Glycomics and Glycosciences. UniCarbKB 
has started to address those standardization guidehnes 
proposed by MIRAGE, by establishing a high-level 
vocabulary that captures (i) the sample preparation pro- 
cedures; encompassing glycan release techniques and/or 
methods that alter glycan structure, including 
exoglycosidase treatment and derivatization, (ii) the 
general analytical approach and (iii) the use of comple- 
mentary validation methods such as lectin studies and 
monosaccharide analysis. This information is provided 
for aU pubhshed references that have been curated and 
the vocabulary is continually expanded to reflect 
database content. By fisting the methods used by the 
authors of the publication to determine the structure, 
users can determine their own level of confidence in the 
reported structures; in particular, by assessing the suitabil- 
ity of orthogonal methods such as array platforms, capil- 
lary electrophoresis, gas chromatography, lectin-binding, 
hquid chromatography, mass spectrometry and nuclear 
magnetic resonance. 

A^-glycan patfiways 

GlycanSynth is a new feature in UniCarbKB that inte- 
grates known genes and enzymes involved in the biosyn- 
thesis of A^-glycans. A fist of enzymes was manually 
curated from the Kyoto Encyclopedia of Genes and 
Genomes (22) and GlycoGene (23) databases. Data 
related to enzyme activity, including but not limited to 
glycosylation-related processes were also catalogued 
from the BRENDA (24) and UniProt databases. In 
addition, the Consortium for Functional Glycomics (25) 
and Carbohydrate-Active enzymes (26) were used as 
valuable resources for extracting glycosyltransferase 
genes and related downstream targets information. 
Furthermore, we aggregated appropriate gene informa- 
tion from the National Center for Biotechnology 
Information (NBCI). 

For each catalogued protein A^-glycosylation-related 
gene name we constructed a broad set of disaccharide re- 
actions that match gene against a particular donor and 
acceptor substrate. In total, 37 glycosyltransferases have 
been documented that are involved in the synthesis of 



A^-glycan structures in humans stemming from the Man5 
structure. A fist of these gene names, enzymes and reac- 
tions is available at http://unicarbkb.org/enzymes. By 
using these reaction rules it is possible to (i) connect 
gene function with glycan structure and (ii) vafidate the 
accuracy of structures in a database based on implicit 
knowledge of the glycosylation machinery. This will be 
achieved by encoding the disaccharide sequences in the 
GlycoCT condensed format or lUPAC form, and using 
a tree traversal technique to assign linkage information. 

INTERFACING UNICARBKB WITH EXTERNAL 
RESOURCES 

Following tfie release of GlycoSuiteDB in 2002 several 
international initiatives fiave developed structural and ex- 
perimental glycan databases notably the CFG, 
EUROCarbDB, BCSDB (27), RINGS (28) and 
JCGGDB. A key component of UniCarbKB is to forge 
relationships with these valuable resources. In the first 
instance we fiave worked with the glycan MS/MS data 
repository, UniCarb-DB (29) and fiquid chromatography 
retention data collection, GlycoBase (18) (projects that 
stemmed from EUROCarbDB) to cross-reference these 
databases of experimental data together through struc- 
ture-based URL finks in UniCarbKB. In partnership 
with Austrafian National Data Service we have integrated 
UniCarbKB curated data collections with Research Data 
Australia — a discovery platform that enhances connec- 
tions between data projects, researchers and institutions 
aimed at promoting the visibility of researcfi. Also, the 
GlycoMod tool (14) (hosted at ExPASy http://web. 
expasy.org/glycomod) designed to predict oligosacchar- 
ides structures from experimentally determined masses is 
now directly finked to UniCarbKB; connecting theoretic- 
ally possible compositions witfi curated glycan structures. 
Finally, we have also made use of the UniProtJAPl Java 
web service (30), which facilitates the integration of 
UniProtKB data into our web application. Here, we 
extract tfie glycoprotein description from UniProtKB for 
afi glycan structure entries that have an assigned protein 
accession number; such information is displayed to the 
user in each protein summary page (Figure 3). 



FUTURE DEVELOPMENTS 

We envisage that this resource will be extended in the 
future to encompass knowledge and information on afi 
glycoconjugates, fiowever, due to limited resources the 
emphasis initially will be placed on publications contain- 
ing well characterized A^- and O-linked structures and the 
associated experimental data on proteins derived from eu- 
karyotic organisms. UniCarbKB wifi be updated on a 
regular basis with newly curated data collections. In the 
short term, we will also enhance the functional informa- 
tion of glycans by cross-linking the SugarBind database 
(31) to UniCarbKB and target sub-structures recognized 
by lectins. 

We plan to make available a web service API this year to 
support access to UniCarbKB data. By using the API 
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Figure 3. Navigating from UniProtKB to UniCarbKB. An example is shown for tlie (A) UniProtKB alplia-2-HS-glycoprotein entry that is hnked to 
(B) the glycan structures on the individual amino acid sites as curated in UniCarbKB, by CAR identifiers. The curated records in the first release of 
UniCarbKB are being updated and included in the sequence annotation feature records of UniProtKB. 



developers will be able to search against UniCarbKB and 
its affiliated mass spectral-based project UniCarb-DB. In 
conjunction with the GlycoRDF project we have started to 
represent our data in a standardized Resource Description 
Framework (RDF) format that will tackle the problems of 
disparate and decentralized databases by using Semantic 
Web technologies to unify content. We also plan to imple- 
ment support for new tools that utilize the growing infor- 
mation stored in UniCarbKB e.g. 'GlycoDigest' (an 
exoglycosidase digestion prediction tool in development at 
SIB) and glycan translators that will support commonly 
used encoding formats including WURCS (Web3.0 
Unique Representation of Carbohydrate Structures). To 
the best of our abilities, our development effort guarantees 
data exchange and tool compatibiUty (32). In the longer 
term we plan to establish UniCarbKB as a structure- 
centric, high-quality glycan database from which all avail- 
able information on each glycan structure is easily 
accessible. 
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